In [ ]:
import numpy as np
%pylab inline
In [ ]:
# Load the data
X = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', skiprows=1, usecols=range(1, 3001))
# Python 2.7 only
y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',', skiprows=1, usecols=[3001],
converters={3001: lambda s: 0 if s=='Endometrium' else 1}, dtype='int')
# Python 3 alternative:
#y = np.loadtxt('data/small_Endometrium_Uterus.csv', delimiter=',',
# skiprows=1, usecols=[3001], dtype='bytes').astype('str')
# Convert 'Endometrium' to 0 and 'Uterus' to 1
#y = np.where(y=='Endometrium', 0, 1)
In [ ]:
# Set up a stratified 10-fold cross-validation
from sklearn import cross_validation
folds = cross_validation.StratifiedKFold(y, 10, shuffle=True)
print folds
Question: Create a cross-validation function that takes a design matrix, label array, scikit-learn classifier, and scikit-learn cross_validation object and returns the corresponding list of cross-validated predictions.
Make sure that you are returning the predictions in the correct order!
Check the documentation of fit(X, y) and predict_proba(X) in sklearn.linear_model.LogisticRegression http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
In [ ]:
def cross_validate(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# TODO
return pred
Now use this function to compute cross-validated predictions on the data.
In [ ]:
from sklearn import linear_model
clf = linear_model.LogisticRegression(C=1e6) # high C means no regularization (we'll talk about regularization next week!)
ypred_logreg = cross_validate(X, y, clf, folds)
Question Complete the code below to compute the cross-validated accuracy and area under the curve of the logistic regression on our data. Plot the ROC curve
In [ ]:
from sklearn import metrics
fpr_logreg, tpr_logreg, thresholds = metrics.roc_curve(y, ypred_logreg, pos_label=1)
print "Accuracy:", #TODO
auc_logreg = metrics.auc(fpr_logreg, tpr_logreg)
plt.plot(#TODO
)
plt.xlabel('False Positive Rate', fontsize=16)
plt.ylabel('True Positive Rate', fontsize=16)
plt.title('ROC curve: Logistic regression', fontsize=16)
plt.legend(loc="lower right")
#plt.savefig('%s/evu_linreg.pdf' % fig_dir, bbox_inches='tight')
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual feature do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance). If a feature has a variance that is orders of magnitude larger that others, it might dominate the objective function and make the estimator unable to learn from other features correctly as expected.
In practice we often ignore the shape of the distribution and just transform the data to center it by removing the mean value of each feature, then scale it by dividing non-constant features by their standard deviation.
Scikit-learn offers tools to deal with this issue.
In [ ]:
from sklearn import preprocessing
X_scaled = preprocessing.scale(X)
Let us now visualize the distribution of one of the features of the data.
In [ ]:
idx_1 = 0 # first feature
fig = plt.figure(figsize=(12, 8)) # (width, height)
fig.add_subplot(221) # 2 x 2 grid, 1st subplot
h = plt.hist(X[:, idx_1], bins=30, color='blue')
plt.title('Feature %d (not scaled)' % idx_1, fontsize=16)
fig.add_subplot(222) # 2 x 2 grid, 2nd subplot
h = plt.hist(X_scaled[:, idx_1], bins=30, color='orange')
plt.title('Feature %d (scaled)' % idx_1, fontsize=16)
idx_2 = 1 # second feature
fig.add_subplot(223) # 2 x 2 grid, 3rd subplot
h = plt.hist(X[:, idx_2], bins=30, color='blue')
plt.title('Feature %d (not scaled)' % idx_2, fontsize=16)
fig.add_subplot(224) # 2 x 2 grid, 4th subplot
h = plt.hist(X_scaled[:, idx_2], bins=30, color='orange')
plt.title('Feature %d (scaled)' % idx_2, fontsize=16)
plt.tight_layout() # improve spacing between subplots
Question Compute the cross-validated predictions of the logistic regression on the scaled data.
In [ ]:
Question Plot the two ROC curves (one for the logistic regression on the original data, one for the logistic regression on the scaled data) on the same plot.
In [ ]:
In a cross-validation setting, we ignore the samples from the test fold when training the classifier. This also means that scaling should be done on the training data only.
In scikit-learn, we can use a scaler to make centering and scaling happen independently on each feature by computing the relevant statistics on the samples in the training set. The mean and standard deviation will be stored to be used on the test data.
In [ ]:
scaler = preprocessing.StandardScaler()
#Xtr = scaler.fit_transform(Xtr)
#Xte = scaler.transform(Xte)
Question Rewrite the cross_validate method to include a scaling step.
In [ ]:
def cross_validate_with_scaling(design_matrix, labels, classifier, cv_folds):
""" Perform a cross-validation and returns the predictions.
Use a scaler to scale the features to mean 0, standard deviation 1.
Parameters:
-----------
design_matrix: (n_samples, n_features) np.array
Design matrix for the experiment.
labels: (n_samples, ) np.array
Vector of labels.
classifier: sklearn classifier object
Classifier instance; must have the following methods:
- fit(X, y) to train the classifier on the data X, y
- predict_proba(X) to apply the trained classifier to the data X and return probability estimates
cv_folds: sklearn cross-validation object
Cross-validation iterator.
Return:
-------
pred: (n_samples, ) np.array
Vectors of predictions (same order as labels).
"""
pred = np.zeros(labels.shape)
for tr, te in cv_folds:
# TODO
return pred
Question Now use the cross_validate_with_scaling method to cross-validate the logistic regression on our data.
In [ ]:
Question Again, compare the AUROC and ROC curves with those obtained previously. What do you conclude?
In [ ]:
In [ ]:
X_norm = preprocessing.normalize(X)
Question Repeat the previous analysis using the normalized data (and the same cross-validation folds), i.e. compare the cross-validated logistic regression on
In [ ]:
In [ ]: